20 research outputs found

    Compression of Structured High-Throughput Sequencing Data

    Get PDF
    Large biological datasets are being produced at a rapid pace and create substantial storage challenges, particularly in the domain of high-throughput sequencing (HTS). Most approaches currently used to store HTS data are either unable to quickly adapt to the requirements of new sequencing or analysis methods (because they do not support schema evolution), or fail to provide state of the art compression of the datasets. We have devised new approaches to store HTS data that support seamless data schema evolution and compress datasets substantially better than existing approaches. Building on these new approaches, we discuss and demonstrate how a multi-tier data organization can dramatically reduce the storage, computational and network burden of collecting, analyzing, and archiving large sequencing datasets. For instance, we show that spliced RNA-Seq alignments can be stored in less than 4% the size of a BAM file with perfect data fidelity. Compared to the previous compression state of the art, these methods reduce dataset size more than 40% when storing exome, gene expression or DNA methylation datasets. The approaches have been integrated in a comprehensive suite of software tools (http://goby.campagnelab.org) that support common analyses for a range of high-throughput sequencing assays.National Center for Research Resources (U.S.) (Grant UL1 RR024996)Leukemia & Lymphoma Society of America (Translational Research Program Grant LLS 6304-11)National Institute of Mental Health (U.S.) (R01 MH086883

    GobyWeb: simplified management and analysis of gene expression and DNA methylation sequencing data.

    Get PDF
    We present GobyWeb, a web-based system that facilitates the management and analysis of high-throughput sequencing (HTS) projects. The software provides integrated support for a broad set of HTS analyses and offers a simple plugin extension mechanism. Analyses currently supported include quantification of gene expression for messenger and small RNA sequencing, estimation of DNA methylation (i.e., reduced bisulfite sequencing and whole genome methyl-seq), or the detection of pathogens in sequenced data. In contrast to previous analysis pipelines developed for analysis of HTS data, GobyWeb requires significantly less storage space, runs analyses efficiently on a parallel grid, scales gracefully to process tens or hundreds of multi-gigabyte samples, yet can be used effectively by researchers who are comfortable using a web browser. We conducted performance evaluations of the software and found it to either outperform or have similar performance to analysis programs developed for specialized analyses of HTS data. We found that most biologists who took a one-hour GobyWeb training session were readily able to analyze RNA-Seq data with state of the art analysis tools. GobyWeb can be obtained at http://gobyweb.campagnelab.org and is freely available for non-commercial use. GobyWeb plugins are distributed in source code and licensed under the open source LGPL3 license to facilitate code inspection, reuse and independent extensions http://github.com/CampagneLaboratory/gobyweb2-plugins

    A comprehensive approach to store HTS data during the analysis life-cycle.

    No full text
    <p>This diagram illustrates how HTS data stored with the approaches described in this manuscript support common analysis steps of a typical HTS study. HTS reads (Tier I) are stored in files ending with the.compact-reads extension. These files can be read directly by alignment programs and facilitate efficient parallelization on compute grids. When the reads are aligned to a reference genome, alignment files are written in sets of two or three files. Files ending with.entries contain alignment entries. Each alignment entry describes how a segment of a read aligns against the reference genome. Files ending in.header contain global information about the reads, the reference genome, and the alignment (See Figure S1A in <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0079871#pone.0079871.s001" target="_blank">File S1</a> for the data schema that precisely describes these data structures). An optional.tmh file stores the identity of the reads that matched the reference so many times the aligner did not output matches for them. Aligned reads can be sorted with the ‘sort’ Goby tool, producing a.index file with enough information to support fast random access by genomic position. A permutation file (extension.perm) can also be produced to improve compression of sorted files (see Methods). Files in Tier II are stand-alone and can be transferred across the network for visualization (e.g., IGV). Files in Tier III are available for some specific types of analyses that require linking HTS alignments back to primary read data.</p

    Benchmark against BZip2 general compression.

    No full text
    <p>Storage efficiency is calculated as the ratio of the size of compressed data with each method (H, H+T or H+T+D) vs BZip2 compressed data size, expressed as a percentage. A storage efficiency of 50% indicates that the specific method compressed the dataset to half the size of method BZip2 compression. Compression/Decompression time ratios measure the ratio of the time it takes a specific method to compress/decompress a dataset compared to the time it takes the BZip2 compression method for the same dataset. A ratio of 200% indicates that the specific method is twice slower than BZip2. See <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0079871#pone-0079871-g001" target="_blank">Fig. 1</a> for a description of the H, H+T and H+T+D methods.</p

    Structured Data Compression Techniques.

    No full text
    <p>We present the techniques that we devised for compressing structured High-Throughput Sequencing (HTS) data. We use a combination of general compression techniques (panel A) and of techniques that take advantage of the information provided by a data schema (B-E). (A) General compression techniques convert structured data to streams of bytes (serialization, typically done one message at a time) and then compressing the resulting stream of bytes with a general purpose compression approach such as Gzip and Bzip2. We use such techniques alone (Gzip and Bzip2 codecs) or in combination with structured data compression (Hybrid codecs, labeled H, H+T or H+T+D according to the technique used). (B) Separate field encoding reorganizes blocks of messages in lists of field values before compressing each field independently. The technique requires compressing blocks of PB messages, or PB chunks. (C) Field Modeling helps compress data by expressing the value of one field as a function of other fields and constants. (D) Template Compression Technique. Here, the data structure is used to detect subset of messages that repeat in the input messages. Fields that vary frequently are ignored from the template. The template values are stored with the number of template repetitions and the values needed to reconstruct the input messages. (E) Domain Modeling Technique. Alignment messages refer to each other with message links (i.e., references between messages) represented here as pair-link messages with three fields: position, target-index and fragment-index of the linked message. We realized that within a PB chunk, it is possible to remove the three fields representing the link and replace them with an integer index that counts how many messages up or down stream is the linked message in the chunk. Links from an entry in a chunk to an entry in another chunk cannot be removed and are stored explicitly with the three original fields.</p

    Performance of spliced alignments with GNSAP and STAR.

    No full text
    <p>Alignments were performed with the GobyWeb and the GSNAP or STAR alignment plugin. One 50 bp single end RNA-Seq sample with about 43 million reads.</p

    Benchmark against a BAM baseline.

    No full text
    <p>This table provides compression size ratios calculated for Tier 2 HTS alignments stored with approach H+T+D, with reads (Tier 1+2), or without (Tier 2 only). Reads are stored as PB data compressed with the BZip2 codec (R-BZIP2) or in FASTQ format compressed with Bzip2. The method H+T+D is configured to preserve soft-clips and their quality scores.</p
    corecore